Suppose we collect data for a group of students in a statistics class with variables X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coefficient, β0 = −6, β1 = 0.05, β2 = 1.
Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.
x1 = 40
x2 = 3.5
b0 = -6
b1 = 0.05
b2 = 1
y <- 1/(1+exp(-(b0+b1*x1+b2*x2)))
cat("The student's probability of receiving an A in the class is ", y)
## The student's probability of receiving an A in the class is 0.3775407
How many hours would the student in part (a) need to study to have a 50 % chance of getting an A in the class?
y = 0.5
x2 = 3.5
b0 = -6
b1 = 0.05
b2 = 1
x1 <- (log(1/y -1) + b0 + b2*x2)/(-b1)
cat("To have a 50% of getting an A, the student would need to study", x1, "hours.")
## To have a 50% of getting an A, the student would need to study 50 hours.
This problem has to do with odds.
On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?
odds <- 0.37
frac <- 37/137
cat("The fraction of people that will default is", frac)
## The fraction of people that will default is 0.270073
Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds that she will default?
p = 0.16
odds <- p/(1-p)
cat("The odds of that individual defaulting are", odds)
## The odds of that individual defaulting are 0.1904762
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
auto <- ISLR2::Auto
auto$mpg01 <- auto$mpg
med <- median(auto$mpg)
auto$mpg01[auto$mpg01 > med] <- 1
auto$mpg01[auto$mpg01 != 1] <- 0
Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
auto_n <- auto[, c(2:8, 10)]
corrplot(cor(auto_n), col=colorRampPalette(c("hotpink4","white","darkgreen"))(200))
cylKDE<-kde2d(auto$cylinders, auto$mpg01)
filled.contour(cylKDE,color.palette=colorRampPalette(c("white",'blue4',"darkviolet",'deeppink4','orange','yellow')), xlab="cylinders", ylab="mpg01", main="mpg01 vs cylinders")
disKDE<-kde2d(auto$displacement, auto$mpg01)
filled.contour(disKDE,color.palette=colorRampPalette(c("white",'blue4',"darkviolet",'deeppink4','orange','yellow')), xlab="displacement", ylab="mpg01", main="mpg01 vs displacement")
horKDE<-kde2d(auto$horsepower, auto$mpg01)
filled.contour(horKDE,color.palette=colorRampPalette(c("white",'blue4',"darkviolet",'deeppink4','orange','yellow')), xlab="horsepower", ylab="mpg01", main="mpg01 vs horsepower")
weiKDE<-kde2d(auto$weight, auto$mpg01)
filled.contour(weiKDE,color.palette=colorRampPalette(c("white",'blue4',"darkviolet",'deeppink4','orange','yellow')), xlab="weight", ylab="mpg01", main="mpg01 vs weight")
accKDE<-kde2d(auto$acceleration, auto$mpg01)
filled.contour(accKDE,color.palette=colorRampPalette(c("white",'blue4',"darkviolet",'deeppink4','orange','yellow')), xlab="acceleration", ylab="mpg01", main="mpg01 vs acceleration")
yeaKDE<-kde2d(auto$year, auto$mpg01)
filled.contour(yeaKDE,color.palette=colorRampPalette(c("white",'blue4',"darkviolet",'deeppink4','orange','yellow')), xlab="year", ylab="mpg01", main="mpg01 vs year")
oriKDE<-kde2d(auto$origin, auto$mpg01)
filled.contour(oriKDE,color.palette=colorRampPalette(c("white",'blue4',"darkviolet",'deeppink4','orange','yellow')), xlab="origin", ylab="mpg01", main="mpg01 vs origin")
Using the above Kernel-Density plots, the following trends appear to be present in distribution:
Cylinder: lower cylinders tend to have an mpg above the median, higher cylinders tend to have an mpg below the median.
Displacement: lower displacement has mpgs both above and below the median, with more above, higher displacement tends to have an mpg below the median.
Horsepower: lower horsepower has mpgs both above and below the median, with more above, higher horsepower tends to have an mpg below the median.
Weight: lower weight tend to have an mpg above the median, with a few below the median, higher weight tends to have an mpg below the median.
Acceleration and year have no clearly conclusive trends. More recent years may hold some correlation towards higher mpg but it is not consistent. Same to acceleration in positively correlated trends but no consistency.
Origin: Also not a super clear trend, origins of a value of “1” have both types of mpgs, origins with a “2” value also have both but have more above the median, origins with a value of “3” have values above the median.
Split the data into a training set and a test set.
library(caret)
## Loading required package: lattice
library(ggplot2)
#cylinders, displacement, horsepower, weight
split_pct <- 0.6
n <- length(auto$mpg01)*split_pct # train size
row_samp <- sample(1:length(auto$mpg01), n, replace = FALSE)
train <- auto[row_samp,]
test <- auto[-row_samp,]
(ISLR2 part F) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
auto_train_mod <- glm(data = train, mpg01 ~ cylinders + displacement + horsepower + weight, family = binomial)
auto_train_mod
##
## Call: glm(formula = mpg01 ~ cylinders + displacement + horsepower +
## weight, family = binomial, data = train)
##
## Coefficients:
## (Intercept) cylinders displacement horsepower weight
## 12.571451 0.226131 -0.021497 -0.039633 -0.002185
##
## Degrees of Freedom: 234 Total (i.e. Null); 230 Residual
## Null Deviance: 325.7
## Residual Deviance: 107.2 AIC: 117.2
Use the parameters found in the logistic regression to compute the predictions directly for the test dataset. Compare to the output of the predict function.
print("for cylinders = 4, displacement = 121, horsepower = 110, and weight = 3615")
## [1] "for cylinders = 4, displacement = 121, horsepower = 110, and weight = 3615"
x1 <- 4
x2 <- 121
x3 <- 110
x4 <- 3615
auto_predict_test <- data.frame(cylinders = x1, displacement = x2, horsepower = x3, weight = x4)
auto_test_pred <- 1/(1+exp(-predict(auto_train_mod,auto_predict_test, type = "response")))
b0 <- auto_train_mod$coefficients[1]
b1 <- auto_train_mod$coefficients[2]
b2 <- auto_train_mod$coefficients[3]
b3 <- auto_train_mod$coefficients[4]
b4 <- auto_train_mod$coefficients[5]
y <- 1/(1+exp(-(b0+b1*x1+b2*x2+b3*x3+b4*x4)))
cat("The predict function had an output of", auto_test_pred, "and the parameters created an output of", y)
## The predict function had an output of 0.5498835 and the parameters created an output of 0.2002
Create a confusion matrix for the train and test datasets. Does the model perform similarly between them? Be sure to explain.
test_pred <- predict(auto_train_mod,test, type = "response")
train_cm <- confusionMatrix(as.factor(as.integer(2*auto_train_mod$fitted.values)), reference = as.factor(train$mpg01))
test_cm <- confusionMatrix(as.factor(as.integer(2*test_pred)), reference = as.factor(test$mpg01))
train_cm$table
## Reference
## Prediction 0 1
## 0 108 11
## 1 12 104
test_cm$table
## Reference
## Prediction 0 1
## 0 61 5
## 1 15 76
print("The model for training and testing performed similarly, as only 14/157 (8.9%) values for the testing data were falsely positive/negative, and only 24/235 (10.2%) for the training data were falsely positive/negative. The proportion of falsely predicted values is actually higher in the training data, but only by about 1.3%.")
## [1] "The model for training and testing performed similarly, as only 14/157 (8.9%) values for the testing data were falsely positive/negative, and only 24/235 (10.2%) for the training data were falsely positive/negative. The proportion of falsely predicted values is actually higher in the training data, but only by about 1.3%."
Compute confidence intervals for the parameters based on the z values.
coeff1 <- rep(0,1000)
coeff2 <- rep(0,1000)
coeff3 <- rep(0,1000)
coeff4 <- rep(0,1000)
n <- length(auto$mpg01)
for (i in 1:1000){
temp_mod <- glm(data = train, mpg01 ~ cylinders + displacement + horsepower + weight, family = binomial)
coeff1[i] <- temp_mod$coefficients[2]
coeff2[i] <- temp_mod$coefficients[3]
coeff3[i] <- temp_mod$coefficients[4]
coeff4[i] <- temp_mod$coefficients[5]
}
quantile(coeff1, c(0.025,0.975))
## 2.5% 97.5%
## 0.2261312 0.2261312
quantile(coeff2, c(0.025,0.975))
## 2.5% 97.5%
## -0.02149707 -0.02149707
quantile(coeff3, c(0.025,0.975))
## 2.5% 97.5%
## -0.03963336 -0.03963336
quantile(coeff4, c(0.025,0.975))
## 2.5% 97.5%
## -0.002185396 -0.002185396
Compute bootstrapped confidence intervals for the parameters you found.
coeff1 <- rep(0, 1000)
coeff2 <- rep(0, 1000)
coeff3 <- rep(0, 1000)
coeff4 <- rep(0, 1000)
n <- nrow(auto)
for(i in 1:1000){
row_samp <- sample(1:n, replace = TRUE)
auto_samp <- auto[row_samp,]
temp_mod <- glm(data = train, mpg01 ~ cylinders + displacement + horsepower + weight, family = binomial)
coeff1[i] <- temp_mod$coefficients[2]
coeff2[i] <- temp_mod$coefficients[3]
coeff3[i] <- temp_mod$coefficients[4]
coeff4[i] <- temp_mod$coefficients[5]
}
quantile(coeff1, c(0.025, 0.975))
## 2.5% 97.5%
## 0.2261312 0.2261312
quantile(coeff2, c(0.025, 0.975))
## 2.5% 97.5%
## -0.02149707 -0.02149707
quantile(coeff3, c(0.025, 0.975))
## 2.5% 97.5%
## -0.03963336 -0.03963336
quantile(coeff4, c(0.025, 0.975))
## 2.5% 97.5%
## -0.002185396 -0.002185396
Re-do the logistic regression with 2 numerical variables (no train-test split). Create a contour plot of the prediction results.
auto_train_mod2 <- glm(data = auto, mpg01 ~ displacement + horsepower, family = binomial)
auto_cont_pred <- predict(auto_train_mod2,auto, type = "response")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
fig <- plot_ly(x = auto$displacement, y = auto$horsepower, z = auto_cont_pred, type = "contour")
fig
For the customer churn dataset, consider the fields Age, Total_Purchase, Account_Manager, Years, and Num_Sites as possible X variables. Note that Account_Manager is a binary categorical variable.
Create histograms to examine how each variable might predict churn.
churn <- read.csv("data/customer_churn.csv")
par(mfrow=c(2,3))
hist(churn$Age)
hist(churn$Total_Purchase)
hist(churn$Account_Manager)
hist(churn$Years)
hist(churn$Num_Sites)
Split the data into train and test datasets.
split_pct <- 0.7
n <- length(churn$Churn)*split_pct # train size
row_samp <- sample(1:length(churn$Churn), n, replace = FALSE)
train <- churn[row_samp,]
test <- churn[-row_samp,]
Fit a logistic regression model—first with all X’s, and then remove those X’s that are not statistically significant, one at a time. Create a confusion matrix for the final model for both the train and test datasets, and compare the results.
churn_train_mod <- glm(data = train, Churn ~ Years + Num_Sites, family = binomial)
test_pred <- predict(churn_train_mod,test, type = "response")
train_cm <- confusionMatrix(as.factor(as.integer(2*churn_train_mod$fitted.values)), reference = as.factor(train$Churn))
test_cm <- confusionMatrix(as.factor(as.integer(2*test_pred)), reference = as.factor(test$Churn))
train_cm$table
## Reference
## Prediction 0 1
## 0 514 50
## 1 18 48
test_cm$table
## Reference
## Prediction 0 1
## 0 208 25
## 1 10 27